AITopics | feature update

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

Neural Information Processing SystemsMar-21-2026, 03:13:46 GMT

Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we introduce a key notion to predict and control feature learning: the angle $\theta_\ell$ between the feature updates and the backward pass (at layer index $\ell$). We show that the magnitude of feature updates after one GD step, at any training time, can be expressed via a simple and general *feature speed formula* in terms of this angle $\theta_\ell$, the loss decay, and the magnitude of the backward pass. This angle $\theta_\ell$ is controlled by the conditioning of the layer-to-layer Jacobians and at random initialization, it is determined by the spectrum of a certain kernel, which coincides with the Neural Tangent Kernel when $\ell=\text{depth}$. Given $\theta_\ell$, the feature speed formula provides us with rules to adjust HPs (scales and learning rates) so as to satisfy certain dynamical properties, such as feature learning and loss decay. We investigate the implications of our approach for ReLU MLPs and ResNets in the large width-then-depth limit. Relying on prior work, we show that in ReLU MLPs with iid initialization, the angle degenerates with depth as $\cos(\theta_\ell)=\Theta(1/\sqrt{\ell})$. In contrast, ResNets with branch scale $O(1/\sqrt{\text{depth}})$ maintain a non-degenerate angle $\cos(\theta_\ell)=\Theta(1)$. We use these insights to recover key properties of known HP scalings (such as $\mu$P), and also introduce a new HP scaling for large depth ReLU MLPs with favorable theoretical properties.

artificial intelligence, machine learning, proceedings, (12 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.38)

Add feedback

7f5f9a88c6516469c83d074c6f2976fb-Paper-Conference.pdf

Neural Information Processing SystemsFeb-16-2026, 03:17:13 GMT

artificial intelligence, machine learning, slow-learnable feature, (17 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.92)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
Information Technology > Data Science (0.67)

Add feedback

d4387c37b3b06e55f86eccdb8cd1f829-Paper-Conference.pdf

Neural Information Processing SystemsNov-20-2025, 04:41:11 GMT

arxiv preprint arxiv, machine learning, natural language, (15 more...)

Neural Information Processing Systems

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

The Impact of Initialization on LoRA Finetuning Dynamics

Neural Information Processing SystemsOct-11-2025, 00:43:00 GMT

In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. [19]. Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme.

arxiv preprint arxiv, initialization, optimal, (12 more...)

Neural Information Processing Systems

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (0.68)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

Neural Information Processing SystemsOct-10-2025, 07:26:02 GMT

By studying a two-layer CNN, we rigorously prove that SAM learns different features more uniformly, particularly in early epochs.

dataset, fast-learnable feature, slow-learnable feature, (15 more...)

Neural Information Processing Systems

Country:

North America > United States (0.14)
Asia > Middle East > Jordan (0.04)

Genre: Research Report > Experimental Study (0.92)

Industry: Information Technology (0.45)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Kerboua, Imene, Shayegan, Sahar Omidi, Thakkar, Megh, Lù, Xing Han, Boisvert, Léo, Caccia, Massimo, Espinas, Jérémy, Aussem, Alexandre, Eglin, Véronique, Lacoste, Alexandre

arXiv.org Artificial IntelligenceOct-6-2025

Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

artificial intelligence, large language model, natural language, (19 more...)

arXiv.org Artificial Intelligence

2510.03204

Country: North America > Canada > Quebec (0.28)

Genre: Research Report > New Finding (0.66)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

Neural Information Processing SystemsMay-27-2025, 05:12:30 GMT

Deep learning succeeds by doing hierarchical feature learning, yet tuning hyper-parameters (HP) such as initialization scales, learning rates etc., only give indirect control over this behavior. In this paper, we introduce a key notion to predict and control feature learning: the angle \theta_\ell between the feature updates and the backward pass (at layer index \ell). We show that the magnitude of feature updates after one GD step, at any training time, can be expressed via a simple and general *feature speed formula* in terms of this angle \theta_\ell, the loss decay, and the magnitude of the backward pass. This angle \theta_\ell is controlled by the conditioning of the layer-to-layer Jacobians and at random initialization, it is determined by the spectrum of a certain kernel, which coincides with the Neural Tangent Kernel when \ell \text{depth} . Given \theta_\ell, the feature speed formula provides us with rules to adjust HPs (scales and learning rates) so as to satisfy certain dynamical properties, such as feature learning and loss decay.

artificial intelligence, machine learning, scale hyper-parameter, (13 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.73)

Add feedback

The Impact of Initialization on LoRA Finetuning Dynamics

Hayou, Soufiane, Ghosh, Nikhil, Yu, Bin

arXiv.org Machine LearningJun-12-2024

In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model as initialization for finetuning, one can either initialize B to zero and A to random (default initialization in PEFT package), or vice-versa. In both cases, the product BA is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an incorrect intuition and that the first scheme (initializing B to zero and A to random) on average yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

arxiv preprint arxiv, initialization scheme, optimal, (10 more...)

arXiv.org Machine Learning

2406.08447

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report > New Finding (0.66)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.66)

Add feedback

Make the Most of Your Data: Changing the Training Data Distribution to Improve In-distribution Generalization Performance

Nguyen, Dang, Haddad, Paymon, Gan, Eric, Mirzasoleiman, Baharan

arXiv.org Artificial IntelligenceApr-26-2024

Can we modify the training data distribution to encourage the underlying optimization method toward finding solutions with superior generalization performance on in-distribution data? In this work, we approach this question for the first time by comparing the inductive bias of gradient descent (GD) with that of sharpness-aware minimization (SAM). By studying a two-layer CNN, we prove that SAM learns easy and difficult features more uniformly, particularly in early epochs. That is, SAM is less susceptible to simplicity bias compared to GD. Based on this observation, we propose USEFUL, an algorithm that clusters examples based on the network output early in training and upsamples examples with no easy features to alleviate the pitfalls of the simplicity bias. We show empirically that modifying the training data distribution in this way effectively improves the generalization performance on the original data distribution when training with (S)GD by mimicking the training dynamics of SAM. Notably, we demonstrate that our method can be combined with SAM and existing data augmentation strategies to achieve, to the best of our knowledge, state-of-the-art performance for training ResNet18 on CIFAR10, STL10, CINIC10, Tiny-ImageNet; ResNet34 on CIFAR100; and VGG19 and DenseNet121 on CIFAR10.

dataset, difficult feature, training data distribution, (12 more...)

arXiv.org Artificial Intelligence

2404.17768

Country: North America > United States > California > Los Angeles County > Los Angeles (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)

Add feedback

LoRA+: Efficient Low Rank Adaptation of Large Models

Hayou, Soufiane, Ghosh, Nikhil, Yu, Bin

arXiv.org Machine LearningFeb-19-2024

In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.

arxiv preprint arxiv, lora, lora layer, (14 more...)

arXiv.org Machine Learning

2402.12354

Country: Europe > Romania > Sud - Muntenia Development Region > Giurgiu County > Giurgiu (0.04)

Genre: Research Report (0.63)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.94)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Filters

Collaborating Authors

feature update

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

7f5f9a88c6516469c83d074c6f2976fb-Paper-Conference.pdf

d4387c37b3b06e55f86eccdb8cd1f829-Paper-Conference.pdf

The Impact of Initialization on LoRA Finetuning Dynamics

Changing the Training Data Distribution to Reduce Simplicity Bias Improves In-distribution Generalization

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

The Feature Speed Formula: a flexible approach to scale hyper-parameters of deep neural networks

The Impact of Initialization on LoRA Finetuning Dynamics

Make the Most of Your Data: Changing the Training Data Distribution to Improve In-distribution Generalization Performance

LoRA+: Efficient Low Rank Adaptation of Large Models